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Abstract 

We address the problem of model selection for Support Vector 
Machine (SVM) classification. For fixed functional form of the kernel, 
model selection amounts to tuning kernel parameters and the slack 
penalty coefficient C. We begin by reviewing a recently developed 
probabilistic framework for SVM classification. An extension to the 
case of SVMs with quadratic slack penalties is given and a simple 
approximation for the evidence is derived, which can be used as a cri- 
terion for model selection. We also derive the exact gradients of the 
evidence in terms of posterior averages and describe how they can be 
estimated numerically using Hybrid Monte Carlo techniques. Though 
computationally demanding, the resulting gradient ascent algorithm 
is a useful baseline tool for probabilistic SVM model selection, since it 
can locate maxima of the exact (unapproximated) evidence. We then 
perform extensive experiments on several benchmark data sets. The 
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aim of these experiments is to compare the performance of probabilis- 
tic model selection criteria with alternatives based on estimates of the 
test error, namely the so-called "span estimate" and Wahba's Gener- 
alized Approximate Cross- Validation (GACV) error. We find that all 
the "simple" model criteria (Laplace evidence approximations, and 
the Span and GACV error estimates) exhibit multiple local optima 
with respect to the hyperparameters. While some of these give per- 
formance that is competitive with results from other approaches in 
the literature, a significant fraction lead to rather higher test errors. 
The results for the evidence gradient ascent method show that also 
the exact evidence exhibits local optima, but these give test errors 
which are much less variable and also consistently lower than for the 
simpler model selection criteria. 

Keywords: Support Vector Machines, model selection, probabilistic 
methods, Bayesian evidence 

1 Introduction 

Support Vector Machines (SVMs) have emerged in recent years as powerful 
techniques both for regression and classification. One of the central open 
questions is model selection: how does one tune the parameters of the SVM 
algorithm to achieve optimal generalization performance? We focus on the 
case of SVM classification, where these "hyperparameters" include any pa- 
rameters appearing in the SVM kernel, as well as the penalty parameter C 
for violations of the margin constraint. 

Our aim in this paper is twofold. First, we extend our work on proba- 
bilistic methods for SVMs to the case of quadratic slack penalties; we also 
develop a "baseline" algorithm which can be used to find in principle exact 
maxima of the evidence. Second, we perform numerical experiments on a se- 
lection benchmark data sets to compare the model selection criteria derived 
from the probabilistic view of SVMs with alternatives that directly try to 
optimize estimates of test error. Our focus in these experiments is less on 
computational efficiency, but rather on the relative merits of the methods in 
terms of the resulting generalization performance. 

We begin in Sec. |2] with a brief review of SVM classification and of its 
probabilistic interpretation; the setup will be such that the extension of the 
probabilistic point of view to the quadratic penalty case requires only small 
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changes compared to linear penalty SVMs. In Sec. |3| we review some criteria 
for model selection that have been proposed based on approximations to the 
test error. We also describe previous approximations to the evidence for 
the linear penalty SVM, and then give an analogue for quadratic penalty 
SVMs. Exact expressions for gradients of the evidence with respect to the 
hyperparameters are then derived in terms of averages over the posterior. 
Sec. [| has a description of the methods we use in our numerical experiments 
on model selection, including the Hybrid Monte Carlo algorithm which we use 
to calculate evidence gradients numerically. The results of our experiments 
on benchmark data sets are discussed in Sec. [|; we conclude in Sec. |] with a 
summary and an outlook towards future work. 

2 SVM classification 

In this section, we give a very brief review of SVM classification; for details 
the reader is referred to recent textbooks or review articles such as [|I], 0]. 
We also sketch the probabilistic interpretation of SVMs, from which we later 
obtain Bayesian criteria for SVM model selection. 

Suppose we are given a set D of n training examples (xi, r/i) with binary 
outputs yi = ±1 corresponding to the two classes. The basic SVM idea 
is to map the inputs x to vectors <f>(x) in some high- dimensional feature 
space; ideally, in this feature space, the problem should be linearly separable. 
Suppose first that this is true. Among all decision hyperplanes w-<p(x)+b = 
which separate the training examples (i.e. which obey yi(w-<f)(xi) + b) > for 
all Xi G X, X being the set of training inputs), the SVM solution is chosen 
as the one with the largest margin, i.e. the largest minimal distance from 
any of the training examples. Equivalently, one specifies the margin to be 
equal to 1 and minimizes the squared length of the weight vector ||w|| 2 ||, 
subject to the constraint that yi(w ■ 4>(xi) + b) > 1 for all i. The quantities 
yi(w-(f)(xi)+b) are again called margins, although for an unnormalized weight 
vector they no longer represent geometrical distances |EJ. This leads to the 
following optimization problem: Find a weight vector w and an offset b such 
that |||w|| 2 is minimized, subject to the constraint that yi(w ■ 4>{xi) +b) > 1 
for all training examples. 

If the problem is not linearly separable, or if one wants to avoid fitting 
noise in the training data, 'slack variables' £j > are introduced which 
measure how much the margin constraints are violated; one thus writes yj(w- 
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4>(xi) +b) > 1 — £j. To control the amount of slack allowed, a penalty term 
{C/p)J2i£i i s then added to the objective function |||w|| 2 , with a penalty 
coefficient C. Common values for the exponent parameter are p = 1 and 
p = 2, giving linear and quadratic slack penalties respectively Training 
examples with ?/j(w ■ <p(xi) + b) > 1 (and hence £j = 0) incur no penalty; 
the others contribute (C/p)[l — ?/i(w ■ <fi(xi) + b)] p each. This gives the SVM 
optimization problem: Find w and b to minimize 
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CEiMw-Mxj + b]) (i) 



where ^(z) is the loss function 

l p {z)= l -{l-zYH{l-z) (2) 

The Heaviside step function H(l — z) (defined as H(a) = 1 for a > and 
if(a) = otherwise) ensures that this is zero for z > 1. For p = 1, l p (z) is 
called (shifted) hinge loss or soft margin loss. 

In the following we modify the basic SVM problem by adding the quadratic 
term ^b 2 /B 2 to (|1|), thus introducing a penalty for large offsets b. A discus- 
sion of why this is reasonable, certainly within a probabilistic view, can be 
found in 0; at any rate the standard formulation can always be retrieved 
by making the constant B large. We can now define an augmented weight 
vector w = (b/B, w) and augmented feature space vectors <f>(x) = (B, 4>(x)) 
so that the modified SVM problem is to find a w which minimizes 

-\\w\\ 2 + cJ2 l p{yi™ ■ 4>( x i)) ( 3 ) 

This statement of the problem is useful for the probabilistic interpretation of 
SVM classification, of which more shortly. For a practical solution, one uses 
Lagrange multipliers ctj conjugate to the constraints |/jW • <f)(xi) > 1 — & and 
finds in the standard way (see e.g. 0) that the optimal (augmented) weight 
vector is w* = J2iUi a i4>( x i)- F° r the linear penalty case p — 1, the aj are 
found from 

™* c ( E a i ~ \ E n,n A', ; (4) 

- a% - \ % i,j J 

Here = K(x,i the elements of the Gram matrix K, obtained by 

evaluating the kernel K(x, x') = 4>(x) ■ 4>(x') = 4>(x) ■ 4>(x') + B 2 for all 
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pairs of training inputs. The corresponding optimal latent or discrimination 
function is 6\x) = w* • <p(x) = J2iUi a iK(x,Xi). Only the with a i: > 
contribute to this sum; these are called support vectors (SVs). SVs fall into 
two groups: If a« < C, one has yfi* = Ui9*(xi) = 1; we will call these the 
"marginal SVs" because their margins are exactly at the allowed limit where 
no slack penalty is yet incurred. For = C, on the other hand, yiQ* < 1, 
and these "hard SVs" are the points at which the slack penalty is active. 
Non-SVs have large margins, yi8* > 1. 

For the quadratic penalty case p = 2, the a» are obtained as the solution 
of (see e.g. 0) 

max | OLi - ^Yl a i a jViyj K ij ) ( 5 ) 

where Kf- = + C~ 1 5ij. Apart from the replacement of K by K , this 
maximization problem is the same as (f|) for the linear penalty case in the 
limit C — > oo where no violations of the margin constraints are allowed. 
There is now only one kind of SV, identified by on > 0. It follows by dif- 
ferentiating (|5|) that for a SV one has J2j a jVjKij — 1- The margin for a 
SV is thus Di6* = Hi J2j a jUjKij — 1 — &i/C, so that all SVs incur a nonzero 
slack penalty. Non-SVs again have yi9* > 1. 

We now turn to the probabilistic interpretation of SVM classification (see 
Refs. HH, £|, P and the works quoted below). The aim of such an interpretation 
is to allow the application of Bayesian methods to SVMs, without modify- 
ing the basic SVM algorithm which already has a large user community. 
(An alternative philosophy would be to consider similar inference algorithms 
which share some of the benefits of SVMs but are constructed directly from 
probabilistic models; Tipping's Relevance Vector Machine is a success- 
ful example of this.) One regards @ as defining a negative log-posterior 
probability for the parameters w of the SVM, given a training set D. The 
conventional SVM classifier is then interpreted as the maximum a posteriori 
(MAP) solution of the corresponding probabilistic inference problem. The 
first term in @ gives the prior Q(w) oc exp(— 1| |w| | 2 ). This is a Gaussian 
prior on w; the components of w are uncorrelated with each other and have 
unit variance. Because only the 'latent function' values 6{x) = w • 0(x)— 
rather than w itself — appear in the second, data dependent term of @, it 
makes sense to express the prior directly as a distribution over these. The 
6{x) have a joint Gaussian distribution because the components of w do, 
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with covariances given by 

(e(x)e(x')) = ({4>{x) ■ w)(w ■ 4>{x'))\ = k(x, x') 



The SVM prior is therefore simply a Gaussian process (GP) over the functions 
9, with zero mean and with the kernel K(x, x') as covariance function. This 
link between SVMs and GPs has been pointed out by a number of authors, 
e.g. 0, ||, 0- It can be understood from the common link to reproducing 



kernel Hilbert spaces [ 10 1 , and can be extended from SVMs to more general 



kernel methods [11]. For connections to regularization operators see also [12 



A nice introduction to inference with Gaussian processes can be found in 



Ref. [HI 



The second term in (||) similarly becomes a (negative) log-likelihood if we 
define the probability of obtaining output y for a given x (and 9) as 

Q(y=±l\x,9) = K(C)exp[-Cl p (y6(x))} (6) 

The constant factor k(C) is determined from /c -1 (C) = max^e"^' 2 ^ + 
e -cip(-z)y £ ensure that Y /y =±i Q(y\x, 9) < 1. In the linear penalty case this 
gives k(C) = l/[l+exp(— 2C)]; for the quadratic penalty SVM, the maximum 
in the definition of k~ 1 (C) is assumed at a value of 9 obeying 9 = tanh(C#) 
and so k(C) can easily be found numerically. The likelihood for the complete 
data set (more precisely, for the training outputs Y — (y± . . . y n ) given the 
training inputs X) is then 

Q(Y\X,9) = HQ(y i \x i ,9) 

i 

With these definitions eq. ([3]) is, up to unimportant constants, equal to the 
log-posteriorQ 

In Q(9\X, Y) = ~ y £ 6(x)K-\x, x') 9{x') - C ^ W^)) + const (7) 

By construction, the maximum of 9\x) gives the conventional SVM classifier, 
and this is easily verified explicitly 0. 

1 In (0) the unrestricted sum over x runs over all possible inputs, and K~\x, x') are the 
elements of the inverse of K(x, x 1 ), viewed as a matrix. We assume here that the input 
domain is discrete. This avoids mathematical subtleties with the definition of determinants 
and inverses of operators (rather than matrices), while maintaining a scenario that is 
sufficiently general for all practical purposes. 
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The probabilistic model defined above is not normalized, since J2 y =±i Q(y\ x , 
1 for generic values of 9{x). The implications of this have been previously 
discussed in detail ||. The normalization of the model is important for the 
theoretical justification of tuning hyperparameters via maximization of the 
data likelihood or "evidence". Nevertheless, experiments in |§ showed that 
promising results for hyperparameter optimization could be obtained also 
with the unnormalized version of the model. We will therefore, in common 
with other work on probabilistic interpretations of SVMs |7[ || || [ 
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disregard the normalization issue from now on. We will also focus on SVM 
classifiers constructed from radial basis function (RBF) kernels 



K(x, x') = k exp 



E 



(X 



X 



ia\2 



2P a 



+ k 



off 



(8) 



where the x a are the different input components, ko is the kernel amplitude 
and k Q ff the kernel offset; k s corresponds to the term B 2 discussed above that 
arises by incorporating the offset b into the kernel. Each input dimension has 
associated with it a length scale l a . Since in the probabilistic interpretation 
K(x,x') is the prior covariance function of the latent function 6(x), each l a 
determines the distance in the x a -direction over which 9{x) is approximately 
constant; large l a correspond to an input component of little relevance (see 
e.g. [0). 



< 



3 Model selection criteria 

3.1 Error bounds and approximations 

Model selection aims to tune the hyperparameters of SVM classification (the 
penalty parameter C and any kernel parameters) in order to achieve the 
lowest test error e, i.e. the lowest probability of misclassification of unseen 
test examples. The test error is not observable directly, and so one is lead to 
use bounds or approximations as model selection criteria. The simplest such 
bounds |T7], [TB[ which have been applied as model selection criteria |0], [2^ 
are expressed in terms of the quantity 

0) 

i 

Here R is the radius of the smallest ball in feature space containing all training 
examples, while Y^i a i can be shown to equal the inverse square of the distance 
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between the separating hyperplane and the closest training points. For RBF 
kernels, R is bounded by a constant since every input point has the same 
squared distance <j>(x) ■ <j>(x) = K(x, x) from the origin. 

More recent work has shown that better bounds and approximations can 
be obtained for the leave-out-out error ei O0 . If 6 l (x) is the latent function 
obtained by training the SVM classifier on the data set with example yj) 
left out, then ei OQ is the probability of misclassification of the left-out example 
if this procedure is applied to each data point in turn, 

eioo = -E#(-^) ( 10 ) 

n i 

where we have abbreviated 6\ = 6 l (xi). Averaged over data sets this is an 
unbiased estimate of the average test error that is obtained from training sets 
of n — 1 examples. This says nothing about the variance of this estimate; 
nevertheless, one may hope that ei OQ is a reasonable proxy for the test error 
that one wishes to optimize. (This is in contrast to the training error, i.e. 
the fraction of all n training examples misclassified when training on the 
complete data set, which is general a strongly biased estimate of test error.) 
For large data sets, ei 00 is time-consuming to compute and one is driven to 
look for cheaper bounds or approximations. Since removing non-SVs from 
the data set does not change the SVM classifier, a trivial bound on ei OQ is the 
sum of the training error and the fraction of support vectors, both obtained 
when training on all n examples. To get better bounds, one writes 



eioc 

n 



which shows that an upper bound on yi[6* — 9]] will give an upper bound 
on eioo- Jaakkola and Haussler proved a bound of this form, yi[9* — 6>*] < 
ctiKa] as before, the on are those obtained from training on the full data 



set. More sophisticated bounds were given by Chappelle and Vapnik |2T 
p2| in terms of what they called the "span". We focus on the case of 
quadratic penalty SVMs, where the span estimates are simplest to state. In 



the simplified version of Ref . , and adapting to our formulation with the 
offset b incorporated into the kernel, the span Si for a support vector can be 
defined as 

S? = imn£W$ (11) 
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where the minimum is over all A = (Ai . . . A n ) with Aj = —1, and Xj = 
whenever aj = 0. With this definition, one can calculate Vi[0* — #*] exactly 
under the assumption that dropping the point X{ from the training set leaves 
the "SV set" unchanged, in the sense that no new SVs arise in the new 
classifier and that all old SVs remain. One thus finds y,i[6* — = ai(Sf — 
1/C). Sf can also be worked out explicitly as Sf = 1/[(K SV + I/C) -1 ]^, 
where Ksv is the Gram matrix K restricted to the SVs and I is the unit 
matrix. (The same result was obtained by Opper and Winther || using a 
slightly different approach.) Using finally that y$* = 1 — a^jC for quadratic 
penalty SVMs, one thus has 



-span 

n 



Wh (aifif - l) , Sf = 1/[(K SV + l/C)-% (12) 



This is only an approximation because the assumption of an unchanged SV 
set will not hold for every SV removed from the training set. 

The span estimate (|1^) of leave-one-out error has the undesirable prop- 
erty of being discontinuous as hyperparameters are varied, making numeri- 
cal optimization difficult. The discontinuity arises from the discontinuity in 
the Heaviside step function H, and from the fact that the size of the ma- 
trix Ksv changes as training examples enter or leave the set of SVs. To 
get around this one can approximate H(z) by a sigmoidal function 



1/[1 + exp(— c\x + C2)] and smooth the span by adding a penalty that forces 
any nonzero Aj to go to zero when aj — > 0. This gives the modified span 
definition 



with the minimum taken over the same A as in (pT|). Explicitly, one finds 

S 2 = 1 V_ 

1 [(Ksv + I/C + ^Agv)- 1 ],, a, 

where Agy is the diagonal matrix containing the nonzero a^. This is easily 
seen to be continuous even when the set of SVs changes as hyperparameters 
are varied. For 77 — ^ one recovers the original span definition (|TT|); for 
i] — > 00, on the other hand, Sf = Ku + 1/C and one recovers the 

Jaakkola and Haussler bound. Overall, the smoothed span estimate for ei OQ 
contains three smoothing parameters c\, C2 and 77. 
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For linear penalty SVMs, Wahba [|IIJ considered a modified version of 
eioo, obtained by replacing the Heaviside step function H(—z) in (|10D by the 
hinge loss h(z) = (1 — z)H(l — z); since h(z) > H{—z), this actually gives 
an upper bound on ei 00 - Wahba's GACV (generalized approximate cross- 
validation) estimate for this modified ei 00 is 

e g acv = - E + ctiKufivie*)) (13) 

n i 

where 

r 2 x < -i 

f(z) = I 1 -1 < X < 1 

[ x > 1 

The first term in the sum in ( |T3D would just give the naive estimate of the 
(modified) ei OQ from the performance on the training set; the second term 
effectively corrects for the bias in this estimate. Because of the nature of the 
function /, e gacv can exhibit discontinuities as hyperparameters are varied 
and this has to be taken into account when minimizing it numerically. 



3.2 Approximations to the evidence 

From a probabilistic point of view, it is natural to tune hyperparameters 
to maximize the likelihood of the data Q(Y\X). By definition, Q(Y\X) = 
j d9 Q(Y\X, 6)Q{6) where the integration is over the values 9{x) of the latent 
function 9 at all different input points x. The likelihood Q(Y\X, 6) only 
depends on the values 9i = 9(xi) of 9 at the training inputs; all other 9(x) 
can be integrated out trivially, so that 



Q(Y\X) = J dOQ(Y\X, 0)Q(0) 

where now the integral is over the n-dimensional vector = (9\ . . . 9 n ). Be- 
cause Q(9) is a zero mean Gaussian process, the marginal Q(0) is a zero mean 
Gaussian distribution with covariance matrix K. The evidence is therefore 

Q(Y\X) = |2vrK|- 1/2 /t n (C) ( dO exp --0 T K -1 - ^Ci„W) (14) 

This n-dimensional integral is in general impossible to carry out exactly. But 
it can be approximated by expanding the exponent around its maximum 
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the solution of the SVM optimization problem. For the linear penalty 
case, this requires some care: because of the kink in the hinge loss l\(z), 
the 6i corresponding to marginal SVs have to be treated separately. The 
result for the normalized log-evidence, suitably smoothed to avoid spurious 
singularities, is || 

E(Y\X) = -\nQ(Y\X) = -\nQ*{Y\X) - -L lndet(I + L m K m ) (15) 
n n In 

where K m is the sub-matrix of the Gram matrix corresponding to the marginal 
SVs, L m is the diagonal matrix with entries 2n[a.i(C — ai)/C] 2 and 

-\nQ*(Y\X) = - J- £0,1*0; - -SM +In«(C) (16) 



The approximation fll5|) is computationally efficient because it only involves 
calculating the determinant of a single matrix of size equal to the number 



of marginal SVs. A related approximation was proposed by Kwok ||15|| . He 
suggested to smooth the kink in the hinge loss by using a sigmoidal ap- 
proximation for the Heaviside function, giving h(z) ~ s(z) = (1 — z)/[l + 
exp[— c(l — z)\ with a smoothing parameter c. This has the disadvantage 
that the SVM solution 9* is no longer a maximum of the smoothed posterior. 
The analogous result to (|I5|) also involves, instead of K m and L m , the whole 
Gram matrix and a diagonal matrix with entries Cs"(yi8*), respectively. One 
thus needs either to evaluate a large determinant of size n, or — somewhat 
arbitrarily — to truncate small values of s"(yi9*) to zero to reduce the size of 
the problem. We therefore do not consider this approach further. 

An approximation similar to (|T^) can easily be derived for the quadratic 
penalty case. The loss function h(z) now has a continuous first derivative and 
so all 8i can be treated on the same footing. The derivation of the Laplace 
approximation is thus standard, and involves the Hessian of the log-likelihood 
at the maximum 0*; the resulting approximation to the (normalized log-) 
evidence is 

E(Y\X) = -\nQ*{Y\X) - ±- lndet(I + M sv K S v) (17) 
n In 

where M sv is a diagonal matrix containing the second derivatives Cl'^yiO*) 
of the loss function evaluated for all the SVs. The calculation of E(Y\X) 
according to ( |TTD requires only the determinant of a matrix whose size is 



11 



the number of SVs, again expected to be manageably small. The matrix 
Msv as defined above is just a multiple of the unit matrix, since I'^iz) = 
H{1 — z) and z = yi9* < 1 for SVs. However, the step discontinuity in I'^z) 
at z — 1 has the undesirable consequence that the Laplace approximation 
to the evidence will jump disco ntinuously when one or several of the q.{ 
reach zero as hyperparameters are varied. We therefore smooth the result by 
replacing I'^iz) = H(l — z) in the definition of Msv by the approximation 
l'z(z) ~ exp[— a/ (1 — z)} for z < 1 (and for z > 1). This is smooth at z = 1 
and also has continuous derivatives of all orders at this point. The value of a 
determines the range of values of z = yiQi around 1 for which the smoothing 
is significant, with a — > recovering the Heaviside step function. 



3.3 Evidence gradients 

Beyond the relatively simple approximations to the evidence derived above, 
it is difficult to obtain accurate numerical estimates of the evidence. This is 
a well-known general problem: while averages over probability distributions 
are straightforward to obtain, normalization constants for such distributions 
- such as the evidence, which is the normalization factor for the posterior - re- 



quire much greater numerical effort (see e.g. [j23|). To avoid this problem, one 
can estimate the gradients of the evidence with respect to the hyperparam- 
eters and use these in a gradient ascent algorithm, without ever calculating 
the value of the evidence itself. As we show in this section, these gradients 
can be expressed as averages over the posterior distribution, which one can 



then estimate by sampling as explained in Sec. [O. 

Starting from eq. (|T4|) we can find the derivative of the normalized log- 
evidence E(Y\X) = n -1 In Q(Y \X) w.r.t. the penalty (or noise) parameter 
C: 

A E (Y\X) dln^C) _ J dO Q{6) gj Zgfcgj) exp [-C 

dC 11 } " dC fdeQ^expl-CEiWi)} 

where the average is, as expected, over the posterior Q(6\D) oc Q(Y\X, 6)Q{6). 
Similarly, the derivative of the log-evidence w.r.t. any parameter A appearing 



12 



in the kernel is 



d_ 
dX 



E(Y\X) 



l_d_ 



1 

tr 

2n 



ln|27rK| - 



JdO±0' r £.K- 1 6exp 


-\0 T K- l O-Y,iCl p {yA) 


J dO exp 


-\e^-^e-Y. l ci p {y l e i )_ 





'9K 
~dX 



K 



-i 



2n\ dX 



--tr f^K-Vl-^K" 1 

2n \ dX \ 



Numerical evaluation of this expression as it stands would be unwise, since 
the difference (i - 00 T K- 1 ) can be much smaller than the two contributions 
individually; in fact, for n = we know that it is exactly zero. It is better 
to rewrite (|T9|), using the fact that the elements of the matrix I — 60 T K. 1 
can be obtained as 



<% — 0i(K -1 0) 



exp 



_d_ 

W 3 



^exp 



The posterior average can thus be worked out using integration by parts, 
giving 

— Gi{Kr l 6)^i = (Cl' p (yj9j)yj0i 

If we define the matrix Y as the diagonal matrix with entries yi so that 
(Y6)i = yi9i, and denote by l' p (Y6) the vector with entries l' p (yi9i), then this 
can be written in the compact form 



I - ^K- 1 ) = C [0[l' p (YO)} T Y] 
Combining this with equation (|l9|) , one has finally 



9 E(Y\X) = -^([l p (YO)FY^K-iO 



dX 



dX 



(20) 



This expression appears to require the inverse K 1 of the Gram matrix, 
which for large n would be computationally expensive to evaluate; however, 



as described in Sec. |4.2j the sampling from the posterior using Hybrid Monte 
Carlo can be arranged so that samples of both and K~ x are obtained 
without requiring explicit matrix inversions. 



(19) 
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4 Numerical methods 



In Sec. [| below, we report results from SVM model selection experiments 
using the criteria described above. Specifically, for linear penalty SVMs we 
compare maximizing the Laplace approximation to the evidence E(Y\X), 
eq. fll5l) , with minimizing Wahba's e gacv , eq. (p~3|) ; for quadratic penalty SVMs 
we again use the relevant approximation to the evidence, eq. (|T7|), and con- 
trast with minimization of the span estimate e spa n of the leave-one-out error, 
eq. (|T2|). These four model selection criteria are "simple" in the sense that 
they can be evaluated explicitly at moderate computational cost. In order 
to be able to compare the different criteria directly, and because one of them 
(egacv) has possible discontinuities as a function of the hyperparameters, we 
use a simple greedy random walk algorithm for optimization that is described 
in Sec. |4.1| . For the other three criteria, more efficient gradient-based opti- 
mization algorithms can be designed [2^] but since our focus here is not on 
computational efficiency we do not consider these. 

For linear penalty SVMs we also studied evidence optimization using 
numerical estimates of the evidence gradients ( |T8|J20| ). The Monte Carlo 
method used to perform the necessary averages over the posterior is outlined 
in Sec. OL while Sec. (O] describes the details of the gradient ascent algo- 
rithm. Note that our use of evidence gradients provides a baseline for model 
selection methods based on approximations to the evidence since it locates, 
up to small statistical errors from the Monte Carlo sampling of posterior 
averages, a local maximum of the exact evidence. 

In all experiments using approximations to the test error (e span and e gacv ) 
as model selection criteria, the hyperparameters being optimized were the pa- 
rameters of the RBF kernel (^|), i.e. the amplitudes ko, k Q Q and the logarithms 
of the length scales l a (one per input dimension). The penalty parameter C 
can be fixed to e.g. C = 1 since these criteria depend only on the proper- 
ties of the SVM solution (i.e. the maximum of the posterior, rather than 
the whole posterior distribution). This SVM solution only depends on the 
product CK(x, x') rather than C and the kernel individually, as one easily 
sees from (HJ5|) or equivalently from (|7]). In contrast, the evidence ([14]) takes 
into account both the position of the posterior maximum and the shape of 
the posterior distribution around this maximum; the latter does depend on 
C . We therefore include C as a hyperparameter to be optimized in evi- 
dence maximization. The SVM predictor of the final selected model will of 
course again be dependent only on the product CK(x, x'); but the value of 
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C itself would be important e.g. for the determination of predictive class 
probabilities. This issue, which we do not pursue here, is discussed in detail 
in Ref. §. 

4.1 Optimization of "simple" model selection criteria 

We optimized the simple model selection criteria using a simple greedy ran- 
dom walk, or "zero temperature Monte Carlo" search. This is a simple adap- 
tation of the common Metropolis Algorithm (see e.g. W% ) used to sample 
from a probability distribution; in the zero temperature limit the algorithm 
reduces to repeatedly adding a small step (which we take to be Gaussian) to 
each parameter, recalculating the quantity being optimized, and moving to 
the new point if and only if the new point yields a better value (higher for 
the evidence, and lower for error estimates). The randomness in the algo- 
rithm may appear disadvantageous in terms of computational efficiency; but 
for our purposes, it is actually helpful since it allows us to assess whether 
the model selection criteria in question have a number of local optima or a 
single (global) optimum. It also made further randomization over the initial 
hyperparameter values unnecessary, and so the experiments with the simple 
model selection criteria were all started with a fixed set of initial values for 
the SVM hyperparameters. A few preliminary trials were used to choose ini- 
tial values with an appropriate order of magnitude, and all results reported 
were initialized with the hyperparameters C = 1, l a = 1 for all length scales, 
ko — 1 and k Q s = 0.1. 

The span error estimate and the Laplace evidence for quadratic loss each 
have additional smoothing parameters that had to be selected (ci, c 2 and rj 
in the span estimate, a for the Laplace evidence; see Sees. |37[] and |T2] respec- 
tively). Appropriate values for these parameters were found by a simple (log) 
line search in the parameter values. These tests were not done extensively, 
but the results for SVM model selection did not seem to depend strongly on 
the values of these parameters as long as they were of a reasonable order of 
magnitude. For all tests presented here the values used were c\ = 5, c 2 = 0, 
r\ — 1 for the span estimate, and a = 0.1 for the Laplace evidence with 
quadratic loss. 

In the greedy random walk algorithm, the step size used for each hy- 
perparameter was adapted separately by measuring the acceptance rate for 
proposed changes in the parameter and scaling the step size up or down 
to keep the acceptance rate close to 50%. Thus a decreasing step size can 
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be taken as one measure of how well the process has converged to an opti- 
mum. The search is terminated when either the step size has become very 
small, or the change to the criterion being optimized becomes very small. It 
was also found during experimentation that a useful addition to the basic 
algorithm was to enforce minimum and maximum values of the hyperparam- 
eters. Without such bounds the algorithm would occasionally get "stuck" 
in a plateau region of the model selection criterion where one or more hy- 
perparameters were either very large or very small. Note that for the kernel 
hyperparameters steps in the random walk were taken in the natural loga- 
rithm of the hyperparameter values, as these scale parameters were expected 
to show a significant range of variation. Steps for C were taken in a linear 
scale, reflecting the smaller range of variation. 

4.2 Estimating evidence gradients 

We used Hybrid Monte Carlo (HMC, see e.g. ||23|| ) to estimate the posterior 
averages required in the expressions and (^0|) for the exact evidence gra- 
dients. The HMC algorithm is a standard technique from statistical physics 
that works by simulating a stochastic dynamics with a Hamiltonian "energy" 
defined by the target distribution plus a "momentum", or kinetic energy 
term. Denoting the momentum variables p, the Hamiltonian we choose for 
our case is 

H(0, p) = ^ P T Kp + Vk- x + V(0), V{9) =CY, l P {vA) (21) 

and the corresponding "Boltzmann" distribution P(9, p) cx exp[— 7i(6, p)] oc 
exp(|p T Kp)<5(#|-D) factorizes over 6 and p, so that samples from Q(6\D) 
can be obtained by sampling from P(6, p) and discarding the momenta p. 
The p are nevertheless important for the algorithm, since they help to en- 
sure a representative sampling of the posterior. An update step in the HMC 
algorithm consists of two parts. First, one updates a randomly chosen mo- 
mentum variable Pi by Gibbs sampling according to the Gaussian distribu- 
tion exp(— |p T Kp); this will in general change the value of the Hamiltonian. 
Second, one changes both 6 and p by moving along a Hamiltonian trajec- 
tory for some specified "time" r; the trajectory is determined by solving an 
appropriately discretized version of the differential equations 

= wr (Kph (22) 
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1^ ~ -de--~ {K e)i ~~de~ (23) 

For an exact solution of these equations, 7i would remain constant; due to the 
discretization, small changes in TC are possible and one accepts the update 
of and p from the beginning to the end of the trajectory with the usual 
Metropolis acceptance rule. Iterating these steps the algorithm will, after 
some initial equilibration period, produce samples from P(0,p). 

The occurrence of K™ 1 in (p3j ) is inconvenient. We circumvent this by 
introducing 6 = K _1 0; 6 is initialized to the SVM solution 6*, since then the 
corresponding 6 is obtained trivially as Q\ = yiai without requiring matrix 
inversions. The Hamiltonian equations (^,^) simplify to 

fa _~ dV(0) 
dr 1 dOi 

and the simple form of the first equation is in fact what motivated our choice 
of the momentum-dependent part of H, eq. (|21|). The correspondence be- 
tween 6 and 6 is maintained by updating 6 = K.6 whenever 6 is changed. As 
a by-product, we automatically obtain samples of K -1 as required for (pop. 

Averages over the posterior distribution are taken by sampling after each 
trajectory step, repeating the procedure over some large number of steps. In 
practice usually the first half of the steps are discarded to allow for equi- 
libration. We chose a total of 40,000 samples, giving 20,000 "production 
samples" with which to calculate the averages needed for the calculation of 
the gradients, eqs. (p~8|) and (|20|). 



4.3 Gradient ascent algorithm 

The numerical values for the gradient of the evidence, estimated as explained 
above, were used in a simple gradient ascent algorithm to move the hyperpa- 
rameters to a local maximum of the evidence. More powerful optimization 
techniques are not feasible in our case because neither the evidence itself, nor 
the Hessian of the evidence are available. The conjugate gradient method, for 
example, incorporates information about the Hessian and also usually em- 
ploys a line search using the values of the function to be optimized. Approx- 
imations to Newton's method such as the Levenberg-Marquardt algorithm 
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also use second derivative information. Fortunately, using the first derivatives 
of the evidence with respect to the hyperparameters leads to convergence to 
some maximum in a reasonable amount of time: typically between 40 and 
80 steps of gradient ascent are required before the gradients have shrunk to 
small values. 

For the experiments described the "learning rate" multiplier for the deriva- 
tive of each parameter is adapted separately throughout the optimization. 
This is necessary as the gradients vary over several orders of magnitude dur- 
ing a typical simulation. In our case the adaptation of the "learning rate" of 
the optimization must be based on the change in the gradients only rather 
than on the change in the evidence itself. We expect gradients to increase 
only at the start of a simulation, but thereafter they should decrease as the 
parameters approach a maximum in the evidence. If the gradients do not 
decline quickly then the learning rate is increased, if the gradients increase 
sharply then the ascent step is discarded and the learning rate is decreased. 
For vector parameters (like the length scales in an RBF kernel) the change in 
gradient diretion can also be used for learning rate adaptation: sudden and 
large changes in the gradient suggest that the optimization may have passed 
a maximum and the step should be redone with a smaller learning rate. As in 
the experiments with zero temperature Monte Carlo search, gradient ascent 
steps for the kernel hyperparameters were actually taken in the logarithms 
of these parameters. 

As noted above, the HMC simulation calculates averages over the poste- 
rior with only a relatively small amount of noise. Consequently, for a given 
set of starting hyperparameters an optimization based on gradient ascent in 
the evidence is practically deterministic. So in order to investigate the prop- 
erties of local maxima in the evidence repeated trials were performed with 
the SVM hyperparameters initialized to random values. A few preliminary 
trials were used to choose reasonable orders of magnitude, and unless spec- 
ified otherwise all results reported begin with uniform random initalization 
in the ranges C G [0.4,0.8], ln/ a G [—1,2] for all length scales, lnA; G [—1, 1] 
and In k g G [—2,-1]. 

4.4 Computational effort 

We conclude this section with a brief discussion of the computational de- 
mands of the various model selection methods; though we stress once more 
that our focus was not on computational efficiency, so that faster algorithms 
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can almost certainly be designed for all of the model selection criteria that 
we consider. 

The computationally cheapest of the simple model selection criteria is 
fgacv, which can be evaluated in time 0(n) from the properties of the trained 
SVM classifier. The span estimate e span requires the inversion of a matrix 
of size equal to the number of SVs; assuming that the number of SVs is 
some finite fraction of n for large n this gives a cost of 0(n 3 ) for large n. 
The Laplace approximations to the evidence, for both linear and quadratic 
penalty SVMs, are dominated by the evaluation of determinants whose size is 
also the number of SVs (or, for linear penalty SVMs, the number of marginal 
SVs), giving again a scaling of approximately 0(n 3 ). 

Table |2] lists the running times for a single optimization step with each of 
the different methods. The evidence approximations and the test error ap- 
proximations showed more or less similar running times, although the span 
estimate took somewhat longer on average. By far the greatest run time was 
needed for the gradient ascent on the evidence, due to the HMC sampling 
involved; a typical optimization run on a single processor HP V-Class took 
anywhere from 6 hours to 6 days. In comparison, most optimizations based 
on the simple model criteria were under an hour. The run time of the HMC 
algorithm should scale relatively benignly as 0(n 2 ) in the size of the training 
set, but our experiments show that the prefactor is large. The n 2 scaling 
comes mainly from the conversion from 6 to via 6 = K0 which is neces- 
sary during the solution of the Hamiltonian equations. (The length of the 
Hamiltonian trajectory, i.e. the time r, does not need to be increased with n; 
the same is true for the number of samples required to obtain the posterior 
averages to a given accuracy.) 

Note that the theoretical dependence of running time on training set size 
was not strictly followed in reality. One reason for this is that the average 
time per step presented includes time spent on discarded steps in the zero 
temperature Monte Carlo search algorithms. That is, the speed of the simple 
optimization techniques used here depends on the complexity of the search 
space. 

As stated above, we were interested in the evidence gradient ascent algo- 
rithm mainly as a baseline for SVM model selection based on probabilistic 
criteria. Computational efficiency could however be increased in a number of 
ways; the Nystrom method ||25|| , for example, could significantly reduce the 
dimensionality (currently n) of the space over which the posterior needs to 
be sampled using HMC. 
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Data set 


Inputs 


Training set size 


Test set size 


Crabs 


5 


80 


120 


Pima 


7 


200 


332 


WDBC 


30 


300 


269 


Twonorm 


20 


300 


7100 


Ringnorm 


20 


300 


7100 



Table 1: Number of input dimensions, and sizes of training and test sets for 
the data sets used in our experiments. 



5 Numerical results 

5.1 Data sets 

The model selection methods under consideration were applied to five two- 
class classification problems that are common in the machine learning lit- 
erature. Three of these are from real world problems: the Pima Indian 
Diabetes data set, the Crabs data set and the Wisconsin Diagnostic Breast 
Cancer (WDBC) data set. The remaining two data sets, Twonorm and Ring- 
norm, are synthetic. The dimensionality of the inputs x and the size of the 
training and test sets for each data set are given in Table III. All bench- 
mark data sets are available through the UCI Machine Learning Repository 
( |http: / /wwwl.ics.uci.edu/~mlearn/MLRepository.html| ) and/or the DELVE 



archive ( http: / /www.cs.toronto.edu/^delve/ ). More detailed descriptions 



are also available on the web. Inputs were standardized so that across each 
complete data set all input components had zero mean and unit variance. 
For each data set the training and test sets were held constant for all ex- 
periments. The first n points in the data set were used for training and the 
remaining points were used for testing. The one exception is the Crabs data 
set, where the 6th attribute (color) was not used for classification and the 
remaining points were sampled to ensure an even distribution of the unused 
color attribute in the training and test sets. The number of training points, 
given in Table [l], was the same as that used in previous research. 
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Data Set 


SVM 


LEI 


LE2 


GACV 


Span 


Evid grad 


Crabs 


0.81 


3 


5 


4 


6 


137 


Pima 


1.23 


10 


9 


11 


21 


1805 


WDBC 


2.1 


19 


26 


39 


64 


9352 


Twonorm 


2.5 


35 


26 


27 


82 


7779 


Ringnorm 


3.7 


58 


71 


68 


216 


10665 



Table 2: Average CPU time (seconds) per optimization step. Times are 
given for: training of the SVM classifier (SVM); evaluation of the Laplace 
approximation to evidence for p = 1 and p = 2 (LEI, LE2); evaluation of 
e g acv an d e sp an (GACV, Span); and evaluation of the evidence gradients (Evid 
grad) 

5.2 Model selection using simple criteria 

We discuss first the results obtained by optimizing the four simple model 
selection criteria: the Laplace evidence (LE), eq. (0), and the GACV ( |T3"D 
for linear penalty SVMs, and the Laplace evidence (|l7j) and span error es- 
timate ( |P2"D for quadratic penalty SVMs. The experiments with gradient 
ascent on the evidence, for linear penalty SVMs, are described separately in 
Sec. 

A typical example of selecting parameters for a linear penalty SVM by 
optimizing the Laplace approximation of the evidence is shown in Fig. |l|. The 
data set for the experiment shown is the Twonorm data set. This example is 
chosen because it shows several typical features that appear in similar forms 
in all of the optimizations. To what extent optimizations for the other data 
sets match this example will be noted where appropriate. 

The parameters all move more or less stochastically to stable final val- 
ues as the evidence is optimized and it is clear that maximizing the Laplace 
Evidence correlates to reducing the error on a test set. Both the Laplace 
evidence and the GACV are shown alongside the test error, although only 
the Laplace evidence is used for optimization. Maximizing the evidence gen- 
erally reduces the GACV, although this correspondence is not strict. Similar 
behaviour is observed for optimization with the Laplace evidence for the 
quadratic penalty SVM, and for optimization of the GACV and the span 
estimate. For quadratic penalty SVMs, the Laplace evidence and the span 
estimate also have the same qualitative correlation as the Laplace evidence 
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Figure 1: Hyperparameter tuning for the Two norm data set by optimizing 
the Laplace approximation to the evidence for linear penalty SVMs. The top 
four graphs show the evolution of the hyperparameters; 5 out of the 20 length 
scale parameters are shown. Below, the Laplace evidence (LE) is shown; the 
GACV error estimate and the test error are also displayed, to demonstrate 
the correlation with the Laplace evidence. 
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and the GACV for the linear penalty case. 

An important issue in all of these methods is the existence of many local 
optima in the model selection criteria. Starting from the same initialization, 
the hyperparameters converged to significantly different values in repeated 
trials. We verified explicitly, e.g. by evaluating the chosen model selection 
criterion along a line in hyperparameter space connecting different end points 
of two optimization runs, that the different local optima found were genuine 
and not artefacts due to incomplete convergence of the optimization algo- 
rithms. The search criteria always deteriorated in between the points found 
by the search, confirming that the latter were in fact local optima. 

To analyse the characteristics of the local optima, 25 repeated trials were 
performed on all data sets for the simple optimization criteria. Comparison 
of the final SVM hyperparameter values at the local optima showed that they 
were highly variable. For all methods and all data sets the variance of the final 
parameter values was always of the same order of magnitude as the average 
value of the final parameters. Tuning of the length scales is often interpreted 
as "relevance determination" for the different dimensions of the data because 
a large length scale indicates that the classification does not vary significantly 
with changes in that parameter. The results here however indicate that the 
relevance of each dimension probably depends in a complicated way on the 
relevance assigned to the other dimensions, and that different assignments of 
the length scales can yield similar results. 

In addition to variance in the final SVM hyperparameter values, the test 
error also showed significant trial to trial variation. For all methods, many 
of the trials result in a final test error that is close to the best achieved by 
any method, but for some methods a large portion of the trials end in a test 
error that is significantly worse. The average and standard deviation of the 
test errors achieved with the different methods are shown in Table [| Table [| 
shows the best test errors achieved on any trial for each method and data set. 
For reference, Table || shows the test errors achieved on the same data sets 
by comparable methods in previous research. Unfortunately, these previous 
studies do not always include error bars for test error results so it is hard to 
compare the results for the averages and standard deviations of the error. 

To illustrate the variability in the final error resulting from each optimiza- 
tion method histograms of the errors achieved in all trials are shown for the 
Twonorm, Pima and WDBC data sets in Figs. |2], |3|, and |4] respectively. These 
plots show the difficulty of picking a "best" method from among the simple 
model selection criteria. For the Twonorm data set all of the methods pro- 
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LEI 


LE2 


GACV 


Span 


Evid grad 


ES 


Crabs 


10.7 ± 2.1 


10.5 ± 1.3 


13.0 ± 1.8 


6.0 ± 1.8 


9.2 ± 1.5 


5.5 ± 1.3 


Pima 


30.3 ± 2.0 


33.5 ± 2.2 


23.2 ±2.7 


21.0 ± 1.1 


20.8 ± 1.5 


19.7 ± 1.5 


WDBC 


5.8 ± 2.5 


5.8 ±3.6 


9.6 ±2.8 


7.8 ±4.2 


4.0 ± 1.2 


2.4 ± 1.0 


Twonorm 


13.5 ± 12.6 


12.6 ± 5.0 


5.2 ± 1.9 


4.6 ± 0.9 


4.0 ± 0.2 


3.7 ±0.4 


Ringnorm 


4.7 ± 1.6 


2.5 ±5.3 


3.3 ± 1.3 


3.5 ± 1.3 


3.2 ±0.6 


3.2 ±0.6 



Table 3: Test error e for all data sets (in %), written in the form "mean 
± standard deviation". Statistics for the simple model selction criteria are 
taken over 25 trials. For gradient ascent in the evidence averages are over 
25 trials for the Crabs data set, and over 10 trials for all other data sets. 
Abbreviations for the model selection criteria are as in Table |2], except for 
the last column (ES = gradient ascent with "early stopping"; see Sec. |5.3.2|) . 



duce test errors that are around the best for any method in previous research, 
but with the evidence approximation for linear penalty SVMs around a third 
of the trials end in errors that are signficantly greater. For the Pima data 
set, all of the simple methods are inferior to the best methods in previous 
research, and both of the evidence approximations perform worse than the 
error estimates; while on the WDBC data set the evidence approximations 
are superior to the error estimates. 

Comparing Tables |3], ^, and |5|, it is clear that the high variability in the 
results achieved by optimizing the four "simple" model selection criteria is 
undesirable; while the best trials for each method and data set are approxi- 
mately the same as the best results reported in previous research, the average 
performance over trials is rather disappointing. One possible productive use 
of the high variability of classifiers produced by convergence to local optima 
of the model selection criteria could be to combine the resulting classifiers in 
some ensemble or voting scheme. Such approaches normally benefit precisely 
from high variability among the classifiers being combined, so this could be 
an interesting subject for future research. 

5.3 Model selection using evidence gradients 

Figs. [5] and |6| show a typical run of evidence gradient ascent on the Twonorm 
data set. Fig. || shows the tuning of a subset of the RBF kernel length scales 
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Figure 2: Histogram of test errors (in %) achieved on the Twonorm data set. 
Shown are the results of 25 trials with the simple model selection criteria, 
and 10 trials of evidence gradient ascent. The bin size for the histogram is 
2%. 
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Figure 3: Histogram of test errors (in %) achieved on the Pima data set. 
Shown are the results of 25 trials with the simple model selection criteria, 
and 10 trials of evidence gradient ascent. The bin size for the histogram is 
2%. 
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Figure 4: Histogram of test errors (in %) achieved on the WDBC data set. 
Shown are the results of 25 trials with the simple model selection criteria, 
and 10 trials of evidence gradient ascent. The bin size for the histogram is 
2%. 
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LEI 


LE2 


GACV 


Span 


Evid grad 


ES 


Crabs 


5.9 


9.2 


10.9 


3.4 


5.0 


3.4 


Pima 


27.8 


28.4 


20.2 


19.0 


19.3 


18.4 


WDBC 


1.9 


3.4 


4.1 


1.9 


1.5 


1.2 


Ringnorm 


2.1 


2.2 


1.9 


2.0 


2.5 


2.5 


Twonorm 


2.9 


3.1 


3.4 


3.5 


3.4 


3.0 



Table 4: Best single trial test error e (in %). Abbreviations for the model 
selection criteria are as in Table |^. 

and Fig. |6| shows the tuning of kernel amplitude k , the kernel offset k g 
and the penalty parameter C. (Although statistics for the performance of 
the evidence gradient method were determined by initialization to random 
parameter values, for the specific sample shown we started all length scales 
with identical parameters.) Both the gradients of the evidence with respect 
to each parameter and the parameter values themselves are shown. The 
gradients typically start at small values, rise to a peak and then decline. 
Most parameter ultimately arrive at a constant value with small gradients, 
indicating that the evidence is at a local maximum with respect to that 
parameter. The optimization is terminated when the gradients have reached 
a small fraction of their peak magnitude. During this process the error on 
the test set decreases significantly. 

As with the simple model selection criteria analysed in the previous sec- 
tion (Laplace approximations to the evidence and error approximations), 
repeated trials of gradient ascent in the evidence showed the existence of 
many local maxima in the evidence at widely varying parameter values. Due 
to the long run time required for the HMC sampling used to calculate the 
evidence gradients only 10 trials were performed for each benchmark data 
set, with the exception of the Crabs data set where 25 trials were performed. 
(See Section |4.4| for a discussion of the running time of the algorithm on the 
various data sets.) As explained in Sec. fO| the gradient ascent algorithm 
is essentially deterministic once the initial hyperparameter values have been 
fixed. Consequently, repeated trials were started from random initial values 
for the hyperparameters in order to investigate the existence and variability 
of local maxima in the evidence. 

Tables [3] and [| above list the resulting test errors obtained with gradient 
ascent optimization of the evidence, along with results obtained from the 
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Figure 5: Tuning the length scales l a on the Twonorm data set, using gradient 
ascent on the evidence. 6 out of 20 length scale parameters are shown, along 
with the corresponding gradients; the bottom plot shows the evolution of the 
test error. 
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Figure 6: Tuning k^, k Q g and C on the Twonorm data set, using gradient 
ascent on the evidence. The gradients for each parameter are shown alongside 
the actual parameter values. The two bottom panels show the evolution of 
the test error, with the right one being a zoom on the range of small error 



values; see discussion in Sec. 5.3.2 
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Data set 


GP Var 


SVM Var 


SVM CV 


GP Lap 


GP MF 


Crabs 


2.5 


3.3 


3.3 


3.3 


1.7 


Pima 


19.9 


20.5 


20.2 


20.2 


19.0 


WDBC 


3.7 


3.7 


3.3 


3.3 


2.6 


Twonorm 


3.2 


3.7 


2.3 


4.0 




Ringnorm 


1.7 


1.9 


2.3 


3.0 





Table 5: Test errors e (in %) found on the benchmark data sets in previous 
work. The methods used were as follows. GP Var: Gaussian process clas- 
sifier (see e.g. 0, 0), with hyperparameters determined by maximizing a 



variational approximation to the evidence |7J]. SVM Var: SVM with hyper- 
parameters selected by the same variational method |7|. SVM CV: SVM, 
with all length scales l a = I set equal and I and k determined by ten-fold 
cross validation |7|]; the offset was unrestricted, i.e. effectively k Q g — > oo. GP 
Lap: Gaussian process classifier, hyperparameters determined by maximiz- 
ing a Laplace approximation to the evidence J7J; GP MF: Gaussian process 
classifier trained by a mean field method [0]. 



simpler methods discussed earlier. Comparing with the results found in pre- 
vious studies (Table |), one sees that gradient ascent on the evidence for 
SVMs with radial basis function kernels achieves approximately the same test 
error as the best methods that have been previously applied. For some data 
sets the best performance obtained by evidence gradient ascent is superior 
to the performance previously reported. An interesting point of comparison 
is with SVM model selection by optimization of a variational approximation 
of the evidence, as described in [[/]]. (This comparison is somewhat tenta- 
tive because of the small number of trials for our evidence gradient ascent 
method, combined with the lack of information about trial-to-trial variation 
in [[7j.) Still, it is worth noting that although the method desribed here uses 
gradients of the evidence without further approximation, and also tunes the 
C parameter (which is effectively fixed to unity in the approach of ) , it does 
not seem to achieve systematically superior performance than the variational 
approximation. 

Figs. 0, |3] and |] above contain the histograms of the test error produced by 
SVM model selection by evidence gradient ascent, and the comparison with 
the simple model selection criteria, for the Twonorm, Pima and WDBC data 



31 



sets. Although strong conclusions cannot be drawn due to the small number 
of trials, gradient ascent in the evidence seems to produce significantly better 
performance on all of the data sets than any of the other methods. In all 
tests the distribution of resulting errors is both closer to the best results 
found in previous studies, and less variable. The most likely explanation for 
the superior performance of the gradient ascent method is the fact that it 
actually maximizes an exact, unapproximated model selection criterion (the 
evidence), while the simple model selection criteria (Laplace evidence and 
error estimates) are all to some extent approximate. The poorly performing 
local optima of these simple criteria may then arise from errors introduced 
by the approximations. 

We comment briefly on the actual values of the hyperparameters found 
by gradient ascent on the evidence, in particular for the kernel amplitude ko 
and the offset k D g. For the Twonorm data set, the maximum in the evidence 
occurs at a relatively large value of ko, with an average of ko w 180 across 
trials. (Large values of k were also found with model selection with the 
approximate evidence, but were much less common when using the error 
estimates.) This may appear surprising. However, it should be born in mind 
that from the probabilistic view the prior variance of the latent function 9 is 
(9 2 (x)) = K(x, x) = k + k o fi. The typical prior scale for 9(x) is therefore y^o 
(since k ^ is small, see below), which equates to around 13 for ko ~ 180; this 
is not unreasonably large compared to the scale of 1 set by the SVM margin. 
Similar final values of ko were obtained for the Crabs and WDBC data sets, 
while for Pima and Ringnorm ko was rather smaller. Previous experiments 
with simple synthetic data sets H suggest that an evidence maximum at 
large ko correlates with small apparent levels of noise in the data set; we 
have not attempted to verify this correlation for our five benchmark data 
sets. 

The offset hyperparameter k oS was typically tuned to very small values by 
evidence gradient ascent (e.g. around 0.03 for the Twonorm data set). This 
provides a posteriori justification for our approach of including the offset 
parameter b from the conventional SVM framework into the kernel. 

5.3.1 Noise in evidence gradients 

In the final portions of the optimization shown in Figs. [5] and || it can be 
observed that there is significant noise in the gradients as the evidence ap- 
proaches a maximum. This arises from statistical fluctuations in the HMC 



32 



sampling, which come to dominate when the true gradient values are small. 
Although the noise could be decreased by increasing the length of the HMC 
runs, that did not seem to be necessary for the cases considered here: because 
of the learning rate adaptation the learning rate is quite small by the time 
the evidence is close to its maximum and the noise in the gradients has little 
effect on the final results. 

In the Twonorm example it can also be seen that when the parameters 
are nearly at a maximum in the evidence the gradients with respect to the 
kernel parameters are calculated as zero in some steps. This effect occurred 
typically for larger values of C. Regions of 0-space where the potential V(0) 
in the Hamiltonian fl2"T|) is zero, i.e. where all > 1, are then much more 
probable then regions where yi$i < 1 for some i. It is then possible that the 
HMC sampling only returns samples from the region with V{6) = 0, where 
l' p {i)i6i) = for all i so that ( [Z0D gives an estimate of zero for all gradients 
with respect to kernel parameters. Experiments showed that scaling the 
trajectory length in the HMC runs proportionally to ^ for large C could 
avoid this effect. The rationale is that the shorter trajectories makes the 
HMC sampling more likely to sample values of which are just outside 
the boundary of the V{6) = region; these still have appreciable posterior 
probability but do give nonzero values for some of the l' p (yi9i). We did not 
explore this issue in detail, however. 

5.3.2 "Overfitting" by evidence maximization 

Close inspection of progress of the test error in Figs. |5| and ^| shows an 
interesting aspect of tuning SVM hyperparameters using the evidence. While 
the overall evolution of the test error shows a large decline as gradient ascent 
on the evidence progresses, a closer look at the region of small error values 
(see the lower right plot of Fig. |) shows that the test error goes through 
a shallow minimum before a small rise to its final value. Not all data sets 
show such a clean example of this behaviour as the Twonorm data set, but 
all except Pima did exhibit the phenomenon to some degree. 

One possible explanation for the observed test error minimum is the fact 
that we are not using the evidence of a properly normalized probabilty model 
(see Sec. §). An alternative interpretation, which seems to us more likely, 
is that we are observing here a kind of overfitting. This takes place not 
on the level of the "network" parameters (w or 9 (x)) as in conventional 
overfitting - which is due to a lack of regularization - but on the level of 
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the hyperparameters: If we imagine sampling a number of data sets of size 
n from a given true distribution, then the evidence as a function of the 
hyperparameters, and hence the position of its maximum, will depend on the 
particular data set. Only for large n would the evidence become independent 
of the data set (and related to the Kullback-Liebler divergence, or cross- 
entropy, between the true distribution over data sets and the one predicted by 
the inference model; see e.g. ||). For finite n, maximization of the evidence 
for a specific data set is therefore not expected to lead to strict minimization 
of the error on an independent test set. 

This interpretation leads naturally to the idea of using an early stopping 
mechanism when optimizing the evidence, where the gradient ascent is aban- 
doned when performance on an independent validation set ceases to improve. 
Note that this is not the same as simply returning to hyperparameter tuning 
by cross-validation; in fact, a grid search using cross-validation error over 
the large number of hyperparameters in our examples (C, ho, fc ff and the 
length scales l a associated with each of the d input dimensions) would be 



essentially impossible (see also |]zzf). To gauge the possible benefits of such 
an approach, we have included in Table |3| above both the final test error 
when the optimization is run until the gradients are small, and the minimal 
value of the test error during the gradient ascent. True early stopping with 
an independent validation set would be expected to yield a performance in 
between these two values; the results in Table |3| suggest that this could be 
useful for some data sets. 



6 Conclusion 

In this paper we have investigated the issue of model selection for SVM 
classifiers. We have restricted ourselves to model selection in the sense of 
tuning the parameters of an RBF kernel and the penalty parameter C, though 
the general approaches described could also be used for choosing between 
different functional forms of the kernel. 

We reviewed briefly the probabilistic view of SVMs, and extended our 
previous work on Laplace approximations to the evidence to the case of SVMs 
with quadratic slack penalties. Exact expressions for the gradients of the 
evidence in terms of posterior averages were also derived, and we described 
how these averages can be estimated numerically using Hybrid Monte Carlo 
techniques and used in a model selection algorithm which performs gradient 
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ascent on the exact (unapproximated) evidence. 

In our numerical experiments on five benchmark data sets, we compared 
optimization of four "simple" model criteria with the evidence gradient de- 
scent. Two of the simple criteria were estimates of test error: the generalized 
approximate cross-validation error (GACV) for SVMs with linear slack penal- 
ties, and the span error estimate for SVMs with quadratic penalties. The 
two other criteria were derived from probabilistic concepts; these were the 
Laplace approximations to the evidence for the linear and quadratic penalty 
cases. Our main result is that all the simple model criteria exhibit multiple lo- 
cal optima with respect to the hyperparameters. While some of the resulting 
"locally optimal" SVM classifiers give test performance that is competitive 
with results from other approaches in the literature, a significant fraction 
lead to rather higher test errors. The results for the evidence gradient ascent 
method show that also the exact evidence exhibits local optima. But these 
give much less variable test errors, which are also typically lower test errors 
than for the simpler model selection criteria. In this sense, "you get what 
you pay for": the computationally rather more expensive evidence gradi- 
ent ascent approach gives better and more consistent performance than the 
cheaper model selection criteria. 

There are a number of directions for possible future work. First, our re- 
sults strongly suggest that the hunt is still on for a model selection criteria for 
SVM classification which is both simple and gives consistent generalization 
performance. Alternatively, one could try to cope with the existence of local 
maxima in the simple model selection criteria by testing the selected models 
on a validation set and performing repeated optimizations until satisfactory 
performance is found. A more interesting approach might be to try to exploit 
the large variability in the locally optimal classifiers by using them in some 
scheme for combining classifiers. Finally, if evidence gradient ascent turned 
out in more comprehensive tests to be the model selection method of choice, 
it would be worth investigating possible speed-ups of the algorithm. We al- 
ready hinted at the Nystrom method [[25|] above, but one could also explore 



running the model selection only on randomly sampled subsets of data, and 
then possibly combining the resulting classifiers appropriately. 
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